PLSC30500, Fall 2024

Part 4. Regression (part b)

Andy Eggers

The math behind OLS

Motivation

OLS estimator: \[(\hat{\beta_0}, \hat{\beta_1}, \ldots, \hat{\beta_K}) = \underset{(b_0, b_1, \ldots, b_K) \in \mathbb{R}^{K+1}}{\arg\min} \, \frac{1}{n}\sum_{i =1}^n\, \left(Y - (b_0 + b_1X_{[1]} + \ldots + b_K X_{[K]})\right)^2\]

Could solve by trying every combination of \(b_0, b_1, \ldots, b_K\).



Let’s try to understand a more efficient solution: \[\hat{\boldsymbol{\beta}} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbf{Y}\]

The case of one predictor

\[(\hat{\beta_0}, \hat{\beta_1}) = \underset{(b_0, b_1) \in \mathbb{R}^{2}}{\arg\min} \, \frac{1}{n} \sum_{i =1}^n\, \left(Y_i - (b_0 + b_1X_i)\right)^2\]

Expanding the sum, \(S\), we get

\[\begin{align} S &= \sum_{i = 1}^n \left(Y_i^2 - 2\beta_0 Y_i - 2\beta_1 X_iY_i + \beta_0^2 + 2\beta_0 \beta_1 X_i + \beta_1^2 X_i^2\right) \\ &= \sum Y_i^2 - 2 \beta_0 \sum Y_i - 2\beta_1 \sum X_i Y_i + n \beta_0^2 + 2\beta_0 \beta_1 \sum X_i + \beta_1^2 \sum X_i^2 \end{align}\]


We want to choose \(\beta_0\) and \(\beta_1\) to minimize \(S\).

Review: optimization via differentiation

How would you choose \(\beta_0\) to minimize \(S = -2\beta_0 + \beta_0^2\)?

Observing that the function is flat only at the minimum, we use differentiation to obtain the first order condition:

\[\frac{\partial S}{\partial \beta_0} = -2 + 2 \beta_0 = 0\]

So the solution is \(\beta_0 = 1\).

Optimization in two dimensions

How would you choose \(\beta_0\) and \(\beta_1\) to minimize \(S = -2\beta_0 + \beta_0^2 - 3 \beta_1 + \beta_1^2 + \beta_0 \beta_1\)?

Taking same approach, first order conditions are

\[\begin{align} \frac{\partial S}{\partial \beta_0} &= -2 + 2 \beta_0 + \beta_1 = 0 \\ \frac{\partial S}{\partial \beta_1} &= -3 + \beta_0 + 2\beta_1 = 0 \end{align}\]

A system of equations!

\[\begin{align} 2 \beta_0 + \beta_1 = 2 \\ \beta_0 + 2\beta_1 = 3 \end{align}\]

By isolating and substituting, we get \(\beta_0 = 1/3\) and \(\beta_1 = 4/3\).

Back to our problem

To find \(\beta_0\) and \(\beta_1\) that minimize \[S = \sum Y_i^2 - 2 \beta_0 \sum Y_i - 2\beta_1 \sum X_i Y_i + n \beta_0^2 + 2\beta_0 \beta_1 \sum X_i + \beta_1^2 \sum X_i^2,\] we use the same approach.

\[\begin{align} \frac{\partial S}{\partial \beta_0} &= -2 \sum Y_i + 2 n \beta_0 + 2\beta_1 \sum X_i &= 0\\ \frac{\partial S}{\partial \beta_1} &= -2 \sum X_i Y_i + 2 \beta_0 \sum X_i + 2 \beta_1 \sum X_i^2 &= 0 \end{align}\]

System of equations:

\[\begin{align} n \beta_0 + \beta_1 \sum X_i &= \sum Y_i \\ \beta_0 \sum X_i + \beta_1 \sum X_i^2 &= \sum X_i Y_i \end{align}\]

Solving our system of equations

\[\begin{align} n \beta_0 + \beta_1 \sum X_i &= \sum Y_i \\ \beta_0 \sum X_i + \beta_1 \sum X_i^2 &= \sum X_i Y_i \end{align}\]

Isolating \(\beta_0\) in the first equation, we get \[ \beta_0 = \frac{\sum Y_i}{n} - \beta_1 \frac{\sum X_i}{n}\]

Substituting into second equation, rearranging, dividing top & bottom by \(n\), we get \[\beta_1 = \frac{\frac{\sum X_iY_i}{n} - \frac{\sum X_i}{n}\frac{\sum Y_i}{n}}{\frac{\sum X_i^2}{n} - \left(\frac{\sum X_i}{n}\right)^2}\]

These are the sample analogues of the bivariate BLP (where \(\beta_1 = \frac{\text{Cov}[X,Y]}{{\textrm V}\,[X]}\)).

Where are we?

We have shown how, with a bit of calculus, we can derive the OLS estimator in the case with one predictor.

If we want \(K > 1\) predictors, could do the same thing:

  • expand the sum \(S\)
  • obtain \(K + 1\) first order conditions
  • solve by isolating and substituting

A horrible mess.

To make things easier, we turn to some linear algebra.

Matrix basics

A matrix

A matrix \(\mathbb{B}\) is a rectangular table of numbers, e.g. 

\[ \mathbb{B} = \begin{bmatrix} 1 & 2 \\ 4 & 0 \\ -1 & 3 \end{bmatrix}\]

This is a \(3\times 2\) matrix (rows \(\times\) columns).


The transpose of \(\mathbb{B}\), written \(\mathbb{B}^T\) or \(\mathbb{B}'\), is

\[ \mathbb{B}^T = \begin{bmatrix} 1 & 4 & -1\\ 2 & 0 & 3 \end{bmatrix}.\]

\(k\)th row of \(\mathbb{B}\) is \(k\)th column of \(\mathbb{B}^T\) (and vice versa).

Matrix multiplication

  • A column vector of length \(K\) is a matrix with one column and \(K\) rows.
  • A row vector of length \(K\) is a matrix with one row and \(K\) columns.

When we multiply a matrix \(\mathbb{B}\) with \(K\) columns by a column vector of length \(K\), \(\mathbf{D}\), we get a column vector of length \(K\):

Matrix multiplication more generally

Matrix manipulation in R

# equivalent ways to make B:
B <- cbind(c(1, 4, -1), c(2, 0, 3))
(B <- matrix(data = c(1, 4, -1, 2, 0, 3), ncol = 2))
     [,1] [,2]
[1,]    1    2
[2,]    4    0
[3,]   -1    3
t(B)
     [,1] [,2] [,3]
[1,]    1    4   -1
[2,]    2    0    3
# also equivalent:
D <- c(2, 1)
(D <- matrix(data = c(2, 1), ncol = 1))
     [,1]
[1,]    2
[2,]    1
B %*% D
     [,1]
[1,]    4
[2,]    8
[3,]    1

Can only multiply \(\mathbb{B}\) and \(\mathbf{D}\) if number of columns in \(\mathbb{B}\) equals number of rows in \(\mathbf{D}\).

Our OLS problem in matrix algebra

Recall our system of equations:

\[\begin{align} n \beta_0 + \beta_1 \sum X_i &= \sum Y_i \\ \beta_0 \sum X_i + \beta_1 \sum X_i^2 &= \sum X_i Y_i \end{align}\]

We can write this in matrix form as

\[ \mathbb{B} \boldsymbol{\beta} = \mathbf{D} \]

where

\[\mathbb{B} = \begin{bmatrix} n & \sum X_i \\ \sum X_i & \sum X_i^2 \end{bmatrix}, \, \, \, \,\, \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}, \, \, \, \,\, \mathbf{D} = \begin{bmatrix} \sum Y_i \\ \sum X_i Y_i \end{bmatrix}\]

The inverse: concept

The inverse of a scalar (number) \(a\) is written \(a^{-1}\), and \(a \times a^{-1} = 1\). (e.g. \(3^{-1} = 1/3\), so \(3 \times 1/3 = 1\).)

We multiply both sides by the inverse of \(a\) when we do

\[\begin{align} a x &= b \\ x &= b/a \end{align}\]


The inverse of a matrix \(\mathbb{B}\) is written \(\mathbb{B}^{-1}\).

We can similarly solve our problem using the inverse of \(\mathbb{B}\):

\[\begin{align} \mathbb{B} \boldsymbol{\beta} &= \mathbf{D} \\ \boldsymbol{\beta} &= \mathbb{B}^{-1} \mathbf{D} \end{align}\]

This is a useful way to represent the solution to a system of equations.

The inverse: implementation

There are manual procedures for inverting matrices. We use R’s solve() function:

# creating our matrices for one toy dataset
x <- c(1, 2, 3)
y <- c(1, 3, 3)
(B <- cbind(c(length(x), sum(x)), c(sum(x), sum(x^2))))
     [,1] [,2]
[1,]    3    6
[2,]    6   14
(D <- c(sum(y), sum(x*y)))
[1]  7 16
# getting the inverse:
solve(B)
          [,1] [,2]
[1,]  2.333333 -1.0
[2,] -1.000000  0.5
# confirming it does what it should
solve(B)%*%B
     [,1] [,2]
[1,]    1    0
[2,]    0    1

The inverse: implementation (2)

Recall our toy dataset, and the system of equations \(\mathbb{B} \beta = \mathbf{D}\) that represents the first order conditions for the \(\beta\) vector that minimizes the sum of squared residuals:

# creating our matrices for one toy dataset
x <- c(1, 2, 3)
y <- c(1, 3, 3)
B <- cbind(c(length(x), sum(x)), c(sum(x), sum(x^2)))
D <- c(sum(y), sum(x*y))

Using solve(), we compute \(\hat{\beta} = \mathbb{B}^{-1}\mathbf{D}\) to solve the system of equations:

solve(B)%*%D 
          [,1]
[1,] 0.3333333
[2,] 1.0000000
# checking the solution
lm(y ~ x) |> coef() 
(Intercept)           x 
  0.3333333   1.0000000 

So we have (1) a way to represent systems of equations and (2) a way to solve them.

The remaining bottleneck: getting the system of equations from the data (expand sum of squared residuals, take derivative, etc).

The general OLS estimator in matrix form

OLS with \(K\) predictors

Let \(\mathbb{X}\) be the \(n \times K + 1\) regressor matrix (a column of 1s and 1 column per predictor):

\[\mathbb{X} = \begin{bmatrix} 1 & X_{[1]1} & X_{[2]1} & \cdots & X_{[K]1} \\ 1 & X_{[1]2} & X_{[2]2} & \cdots& X_{[K]2} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & X_{[1]n} & X_{[2]n} & \cdots & X_{[K]n} \end{bmatrix}\]

Let \(\boldsymbol{\beta}\) be the \(K+1\)-length column vector of coefficients: \[\boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_{K} \end{bmatrix}\]

Let \(\mathbf{Y}\) be the \(n\)-length column vector of outcomes: \[\mathbf{Y} = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_{n} \end{bmatrix}\]

OLS with \(K\) predictors (2)

For a given coefficient vector \(\boldsymbol{\beta}\),

  • \(\mathbb{X}\boldsymbol{\beta}\) is a column vector of predictions
  • \(\mathbf{Y} - \mathbb{X}\boldsymbol{\beta}\) is a column vector of residuals
  • \((\mathbf{Y} - \mathbb{X}\boldsymbol{\beta})^T (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta})\) is the sum of squared residuals

OLS estimator in matrix form:

\[(\hat{\beta_0}, \hat{\beta_1}, \ldots, \hat{\beta_K}) = \underset{(b_0, b_1, \ldots, b_K) \in \mathbb{R}^{K+1}}{\arg\min} \, (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta})^T (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta}) \]

Now we do minimization via differentiation for this version of the problem.

Minimization via differentiation (again)

Expand the sum:

\[ (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta})^T (\mathbf{Y} - \mathbb{X}\boldsymbol{\beta}) = \mathbf{Y}^T \mathbf{Y} - \mathbf{Y}^T \mathbb{X} \boldsymbol{\beta} - \boldsymbol{\beta}^T \mathbb{X}^T \mathbf{Y} + \boldsymbol{\beta}^T\mathbb{X}^T \mathbb{X} \boldsymbol{\beta}\]

Differentiating with respect to \(\boldsymbol{\beta}\) gives us the first order condition(s), \[2 \mathbb{X}^T \mathbb{X}\boldsymbol{\beta} - 2 \mathbb{X}^T \mathbf{Y} = 0,\]

which becomes the system of equations \[\mathbb{X}^T \mathbb{X}\boldsymbol{\beta} = \mathbb{X}^T \mathbf{Y} \]

with solution \[ \boldsymbol{\beta} = \left(\mathbb{X}^T \mathbb{X}\right)^{-1} \mathbb{X}^T \mathbf{Y}.\]

Back to our single-predictor case

Our system of equations was:

\[\begin{align} n \beta_0 + \beta_1 \sum X_i &= \sum Y_i \\ \beta_0 \sum X_i + \beta_1 \sum X_i^2 &= \sum X_i Y_i \end{align}\]

In matrix form:

\[\begin{bmatrix} n & \sum X_i \\ \sum X_i & \sum X_i^2 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} = \begin{bmatrix} \sum Y_i \\ \sum X_i Y_i \end{bmatrix}\]

This is \(\mathbb{X}^T \mathbb{X} \boldsymbol{\beta} = \mathbb{X}^T \mathbb{Y}\):

\[\begin{align}\mathbb{X}^T \mathbb{X} &= \begin{bmatrix} 1 & 1 & \cdots & 1 \\ X_1 & X_2 & \cdots & X_n \end{bmatrix} \begin{bmatrix} 1 & X_1 \\ 1 & X_2 \\ \vdots & \vdots \\ 1 & X_3 \end{bmatrix}\\ &= \begin{bmatrix} n & \sum X_i \\ \sum X_i & \sum X_i^2\end{bmatrix} \end{align}\]

\[\begin{align}\mathbb{X}^T \mathbf{Y} &= \begin{bmatrix} 1 & 1 & \cdots & 1 \\ X_1 & X_2 & \cdots & X_n \end{bmatrix} \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} \\ &= \begin{bmatrix} \sum Y_i \\ \sum X_i Y_i \end{bmatrix} \end{align}\]

Illustration with CCES data

dat <- read.csv("./../data/cces_2012_subset.csv")
use <- !is.na(dat$env) & !is.na(dat$gender) & !is.na(dat$aa)
X <- cbind(1, dat$gender[use], dat$aa[use])
head(X, 3)
     [,1] [,2] [,3]
[1,]    1    1    4
[2,]    1    2    1
[3,]    1    2    4
Y <- dat$env[use]
head(Y, 3)
[1] 4 1 5
solve(t(X) %*% X) %*% t(X) %*% Y
           [,1]
[1,] 1.14259962
[2,] 0.07957051
[3,] 0.65731827
lm(env ~ gender + aa, data = dat)

Call:
lm(formula = env ~ gender + aa, data = dat)

Coefficients:
(Intercept)       gender           aa  
    1.14260      0.07957      0.65732  

Summing up

Most important to know about OLS:

  • OLS (minimizing sum/mean of squared residuals in sample) is an estimator of the BLP
  • you decide which BLP you want to approximate

Today we have added some intuition about how OLS is estimated:

  • we minimize the sum/mean of squared residuals using differentiation
  • the first order conditions constitute a system of equations, one for each coefficient
  • we can solve that system by isolating and substituting
  • or we can use matrix algebra: \(\boldsymbol{\beta} = (\mathbb{X}^T\mathbb{X})^{-1} \mathbb{X}^T \mathbf{Y}\)

More resources

For walk-through of the optimization part of matrix OLS, see Ben Lambert’s econometrics videos:

Also see video on variance of OLS estimator given homoskedasticity

Overall takeaways

Goals for course

Our goal in this class was to “understand topics every social scientist needs to know”:

  • Linear regression (OLS)
  • Statistical inference (standard errors, p-values, null hypothesis)

Review list of (mis)conceptions in Lecture 0!

Goals for course (2)

Quant methods is about clarifying what can be learned from observing and counting things in the world.

This can be surprisingly difficult! But we hope you

  • see why it is important to do this correctly, and
  • recognize that with effort you can understand any area (seriously!) of applied methods, and
  • even enjoy some part of that process?

Remember: often data collection is most important, and with great data we usually don’t need fancy methods.